Genome Research — Latest Matching Preprints

1

Temporal Transcriptomics Identifies Isoform-specific Trans-regulation by Multiple lncRNAs in Human iPSCs

Liu, M.; Mamede, I.; Sofi, S.; Pereira, I.; Dostal, V.; Pashos, A. R. S.; McMahon, C.; Waikar, A.; Stephenson, G.; Cech, T. R.; Rinn, J. L.

2026-05-14 genomics 10.64898/2026.05.13.724994 medRxiv

Top 0.1%

34.5%

Show abstract

Some long non-coding RNAs (lncRNAs) are known to regulate gene expression. However, the underlying temporal dynamics of lncRNAs influencing gene and epigenetic regulation and mechanisms of lncRNA regulation in trans are less understood. To investigate this, we genetically engineered 17 doxycycline-inducible lncRNA transgenes for ectopic expression at the H11 safe harbor locus in human pluripotent stem cells (hiPSCs), and we generated high-density temporal RNA-seq and ATAC-seq profiles. Most lncRNA transgenes were induced at 2 hours and maintained expression through the 96-hour time course. Surprisingly, when we sought to identify gene expression changes due to the lncRNAs, we found that the global transcriptional landscape was dominated by a strong systemic response triggered by doxycycline exposure. We rigorously defined this cohort of genes as a Doxycycline-Responsive Gene Signature (DRGS). The DRGS was also present in at least 28 public datasets from dox-inducible transgene studies involving diverse cell types. Next, we determined which lncRNAs exhibited trans-regulatory events. We identified DANCR, FENDRR, LINC00667, LINC00847, LNCPRESS1, and PNKY as lncRNAs that regulate specific transcript expression in trans. The downstream target genes encoded 53 mRNAs and 10 lncRNAs. None of the target lncRNAs altered gene expression proximal to their own loci (i.e., triggering secondary cis-effects). Surprisingly, the target genes of LINC00847 (transcribed from chromosome 22) were substantially enriched on chromosome 19, with a preponderance of target genes encoding RNA metabolism and RNA splicing factors. Collectively, our study provides a resource to discern artifacts in the doxycycline-inducible system and identifies temporally regulated targets of 6 lncRNAs for future mechanistic studies.

2

Fiber-TEnCATS reveals haplotype-specific chromatin accessibility and DNA methylation at human L1HS loci

Pavlovic, K.; McDonald, T. L.; Diehl, A. G.; Switzenberg, J. A.; Boyle, A. P.

2026-06-28 genomics 10.64898/2026.06.26.734832 medRxiv

Top 0.1%

26.6%

Show abstract

Human-specific long interspersed nuclear element-1 (L1HS) is an active and autonomous retrotransposon in the human genome. Changes in its transcription and transposition are known to affect cellular processes involved in development and aging, and diseases such as neurological disorders and cancer. To better understand natural variability in epigenetic patterns that affect L1HS regulation, we developed a targeted long-read method to simultaneously profile individual haplotypes for DNA methylation and chromatin accessibility across L1HS loci in a healthy human cell line trio. We show that the intronic L1HS in the ZNF638 gene consistently displays high chromatin accessibility and DNA hypomethylation with bidirectional transcription. Our approach also reveals additional intronic and intergenic L1HS copies with allele-specific chromatin accessibility and methylation, and instances of reduced promoter DNA methylation that does not correspond with increased chromatin accessibility. We also identify potential cases of non-Mendelian inheritance of DNA methylation patterns over a subset of L1HS promoters. Our methods high coverage over L1HS loci enables detection and profiling of loci that are missed even by long-read-based assemblies and enables more accurate inheritance tracing of L1HS insertions. Overall, our results offer new insights into the locus-specific regulation of both reference and non-reference L1HS within the human genome.

3

Genomic 8-oxoguanine is associated with transcriptionally active chromatin and elevated gene expression in Plasmodium falciparum

Acharya, D.; Vembar, S. S.

2026-06-25 genomics 10.64898/2026.06.21.733641 medRxiv

Top 0.1%

18.8%

Show abstract

Epigenetic regulation is central to the developmental progression and pathogenicity of the unicellular eukaryotic parasite Plasmodium falciparum; yet, the contribution of DNA base modifications remains poorly understood. One such modification, 8-oxoguanine (8-oxoG), which was initially identified as an oxidative lesion and a marker of DNA damage, has since emerged as a transcriptional regulator in advanced eukaryotes. Given that P. falciparum encounters a highly oxidative environment in human blood, we investigated the potential gene regulatory role of 8-oxoG during its intra-erythrocytic developmental cycle (IDC). Using immunodetection assays, we first confirmed the presence of 8-oxoG in P. falciparum genomic DNA and observed a gradual increase in 8-oxoG abundance from ring to schizont stages. We then optimized oxidative DNA immunoprecipitation sequencing (OxiDIP-seq) for the highly AT-rich parasite genome and generated genome-wide 8-oxoG profiles across four IDC timepoints, which revealed reproducible enrichment of 8-oxoG at discrete genomic loci, with more than 50% of the peaks stable across developmental stages. Notably, 8-oxoG accumulated at putative G-quadruplex-forming sequences in the parasite genome and preferentially localized within exonic regions of protein-coding genes, exhibiting a marked enrichment near STOP codons and within 3' untranslated regions. This in turn correlated with significantly higher steady-state transcript levels of 8-oxoG-marked genes, with stage-specific changes in 8-oxoG enrichment closely matching transcriptional activity. Furthermore, 8-oxoG-marked loci were preferentially associated with active and poised histone post-translational modifications, while showing no evidence of altered nucleosome occupancy. Collectively, these findings demonstrate that 8-oxoG is a widespread and non-random DNA modification in P. falciparum and suggest that it may function as an epigenetic mark associated with transcriptionally permissive chromatin and gene activation during parasite blood-stage development.

4

The Revised Diploid Genome Sequence of an Individual Human: An Optimized Assembly Workflow for Scaling of near Telomere-to-Telomere Assemblies

Lok, S.; Lau, T. N.; Tong, A. H.; Trost, B.; Reuter, M. S.; Thiruvahindrapuram, B.; Paton, T.; MacDonald, J. R.; Lau, L.; Marshall, C. R.; Venter, J. C.; Scherer, S. W.

2026-05-08 genetic and genomic medicine 10.64898/2026.05.01.26352134 medRxiv

Top 0.1%

18.7%

Show abstract

The first draft diploid genome assembly (HuRef) of an individual released in 2007 was a milestone in genomics. Here, we report HuRef2.0, a revision of HuRef assembled using a scalable two-step workflow employing only Oxford Nanopore Technologies (ONT) Simplex reads and the hifiasm assembler. Results are close in continuity to the recent telomere-to-telomere (T2T) assemblies, but were assembled from standard DNA samples without using multiple sequencing and mapping technologies, including ultra-long-reads and/or proximity-ligation. Three ONT flowcells ([~]103x coverage) from fresh blood DNA produced an assembly comprising 26 contigs, with gapless assembly of 23 chromosomes. Two gaps on chromosome (Chr)Y were locally assembled to yield the final T2T-assembly, HuRef2.0, with base accuracy >Q60 and 2,393 phase blocks with an NG50 value of 2.36 Mb. Assembly from a single ONT flowcell ([~]35x coverage) consistently produced an assembly more contiguous than GRCh38.p14, providing a foundation for further optimization and scaling. Assembly quality was assessed by direct chromosome-level alignments to reference genome, variant calling, and the annotation of gene-rich regions at Chr22q11, the extended MHC locus on Chr6, and several difficult to assemble regions of the genome, including the ribosomal RNA gene clusters, the sub-telomeric region on Chr4q35, and ChrY. More accurate but shorter Pacific Biosciences (PacBio) HiFi-reads produced less contiguous assemblies than from equivalent coverage of error-corrected ONT reads, indicating the importance of read-length. Finally, we compared HuRef2.0 to an assembly of an EBV-transformed lymphoblastoid cell line derived from the same donor. We observed no notable structural differences, indicating that low-passage archival transformed cells are reliable sources for genomic analysis.

5

Single molecule footprinting measures low nucleosome occupancy in mature spermatozoa of mice and men

Gaspa-Toneu, L.; Shi, H.; Ozonov, E. A.; Gill, M. E.; De Geyter, C.; Peters, A. H. F. M.

2026-07-01 genomics 10.64898/2026.06.30.735528 medRxiv

Top 0.1%

18.6%

Show abstract

Nucleosomes are fundamental units of DNA packaging and gene regulation in eukaryotes. In mammalian sperm, most nucleosomes are replaced by protamines causing extreme chromatin compaction. Various epigenomic studies reported conflicting results on the distribution of residual nucleosomes in mammalian sperm, questioning their potential role in mediating intergenerational inheritance of paternal epigenetic information. Here we performed single-molecule footprinting through Nucleosome Occupancy and Methylome (NOMe) sequencing and applied the Bayesian statistical model nomeR to determine frequencies of nucleosome removal and retention at 103 specific genomic regions in thousands of developing haploid spermatids and mature spermatozoa of mice. While we readily detected footprints of nucleosomes and the transcription factor CTCF in round spermatids, chromatin became transiently highly accessible in elongating spermatids with loss of such footprints, indicating extensive chromatin reprogramming during spermiogenesis. In mature sperm, following nuclear decondensation with recombinant nucleoplasmin, we measured nucleosome occupancy frequencies ranging ~1.2 to 1.7% at mouse loci. In human sperm, nucleosome occupancy varied between ~2.3 to 4.5% at 163 genomic loci profiled. Contrasting mice, chromatin in ~25% of human sperm was accessible upon reducing disulfide bonds between protamines arguing for species specific protamine packaging. Our findings support a stochastic rather than programmed potential role of residual nucleosomes in mammalian sperm in regulating paternal gene expression during ensuing embryonic development.

6

Efficient and accurate near telomere-to-telomere haplotype reconstruction of diploid genomes

Liu, Y.; Yichen, L.; Xu, J.; Tan, Z.; Zhang, W.; Wang, L.; Xu, L.; Zeng, X.; Schoenhuth, A.; Luo, X.

2026-05-24 bioinformatics 10.64898/2026.05.20.726711 medRxiv

Top 0.1%

18.1%

Show abstract

Telomere-to-telomere (T2T) and haplotype-resolved assembly are crucial for understanding eukaryotic genomes. For diploid species, this resolution is critical to uncover allelic variations, inheritance patterns, and functional genomic traits. Current scaffolding methods typically employ either sequence-based or graph-based strategies. Sequence-based approaches rely on proximity signals to yield high contiguity, but underutilize assembly graph information, resulting in more structural errors and chromosomal misassignments. Graph-based methods leverage graph topology for higher accuracy but frequently struggle to achieve chromosome-scale contiguity. However, neither strategy alone can overcome its inherent limitations to simultaneously achieve high contiguity and accuracy. To address these challenges, we introduce HapFold, the first hybrid scaffolding framework that synergistically leverages the complementary strengths of both graph-based and sequence-based approaches. By integrating the topological accuracy of assembly graphs with the proximity-guided contiguity of sequence models, HapFold achieves highly accurate, chromosome-scale or near-T2T haplotype reconstructions for diploid genomes. Compared to existing methods, HapFold achieves superior assembly quality while accelerating computation by an order of magnitude. Furthermore, in the haplotype reconstruction of diploid genomes using standard Oxford Nanopore Technologies simplex reads, HapFold enables the reconstruction of a greater number of near-T2T assemblies. Our approach provides a robust and scalable solution for the high-fidelity reconstruction of haplotype-resolved diploid genomes.

7

Lossless compression of k-mer matrices enabling random row access

Regnier, A.; Lemane, T.; Bellenous, S.; Chikhi, R.; Peterlongo, P.

2026-07-08 bioinformatics 10.64898/2026.07.03.736306 medRxiv

Top 0.1%

18.0%

Show abstract

Genomic search engines such as Logan-Search index petabytes of sequencing data as large binary matrices, called k-mer matrices, where each row encodes the presence of a k-mer across thousands to millions of genomic samples. Logan-Search contains a petabyte of binary matrices, and storing them is expensive, yet compression must not prevent fast random access to any matrix row at query time. We present kmcomp, a lossless compression method for k-mer matrices that satisfies these competing requirements. Block compression partitions the matrix into fixed-size row blocks, each compressed independently; block start positions are stored in an Elias-Fano encoded array, enabling O(1) random access to any block. To improve compressibility without introducing additional decompression steps, we introduce the {pi}-compression: a column reordering that groups similar samples together by solving the Traveling Salesman Problem via a nearest-neighbor heuristic. We accelerate this heuristic with a novel variant of the vantage-point tree, the masked vp-tree, which dynamically prunes nearest-neighbor search space. On three (meta)genomic datasets, kmcomp achieves compression ratios of 1.3 to 5.4; {pi}-compression further improves these to 1.5 to 51.3. Applied to the Logan-Search petabyte-scale index, compression reduces storage by approximately half, and {pi}-compression adds a further 13% gain. Query overhead remains modest: queries of hundreds of nucleotides incur an absolute latency increase of {approx} 100 ms, and highly compressed indexes can match uncompressed query times thanks to reduced disk reads.

8

Transcriptional Characterization of Nuclear-Integrated Organellar DNA in Populus

Arneson, R.; Wittstock, W.; Marceau, A.; Yuan, Y.

2026-07-12 genomics 10.64898/2026.07.08.737317 medRxiv

Top 0.2%

17.7%

Show abstract

The continuous transfer of organellar DNA into the nuclear genome during eukaryotic evolution has resulted in the widespread occurrence of nuclear plastid DNA insertions (NUPTs) and nuclear mitochondrial DNA insertions (NUMTs). However, their functional significance in nuclear gene expression and genome evolution remains largely unresolved. In this study, we employed Oxford Nanopore Direct RNA Sequencing (DRS) to investigate the transcription of NUPTs and NUMTs in the Populus nuclear genome and compared their transcriptional characteristics with their genome-wide insertion patterns. Our analyses revealed that the majority of transcribed NUPTs and NUMTs are enriched within introns and are co-transcribed with their host or adjacent genes in polycistronic-like transcriptional units. In addition, NUPTs and NUMTs frequently generate intronless transcripts, features reminiscent of their prokaryotic ancestry. We further identified a putatively functional NUPT-derived psbH gene that is unique to P. trichocarpa, providing new insights into the evolution of nuclear-encoded organelle-targeted genes. In addition, we identified transcribed NUPT and NUMT insertion polymorphisms among alleles, suggesting that organellar DNA insertions contribute to allelic variation and may participate in environmental adaptation. Collectively, our findings reveal previously unrecognized roles of NUPT and NUMT transcription in gene regulation, allelic variation, genome evolution, and the emergence of novel genes.

9

DNA 6mA marks transcriptionally active chromatin in malaria parasites

Seshan, D.; Lauer, W.; Sarkar, G.; Govindasamy, M.; Murray, C. S.; Greer, E. L.; Smith, M. L.; Vembar, S. S.

2026-06-13 genomics 10.64898/2026.06.12.732001 medRxiv

Top 0.2%

15.3%

Show abstract

DNA N6-methyladenine (6mA) has emerged as a significant epigenetic modification across a broad range of eukaryotes, from unicellular protists to metazoa. However, its role in unicellular eukaryotic parasites with highly AT-rich genomes, such as malaria-causing Plasmodium falciparum, remains unclear. Using mass spectrometry, South-western blotting, and Single Molecule Real-Time sequencing (Pacific Biosciences) across four stages of P. falciparum intra-erythrocytic development (IED), we confirmed that 0.02-0.04% of genomic adenines are modified to 6mA, with over 60% of the sites being stably maintained during the IED cycle. Notably, 6mA is enriched at transcription start sites, with genes bearing 6mA marks within their 5 and 3 untranslated regions exhibiting significantly elevated steady-state transcript levels. Consistent with this, 6mA loci show a strong positive correlation with activating histone post-translational modifications, while showing no significant association with repressive histone marks. Furthermore, in contrast to unicellular ciliates such as Oxytricha and Tetrahymena - organisms that share ancestry with Plasmodium - 6mA-marked genomic regions do not occlude nucleosomes. Lastly, we identified a putative 6mA methyltransferase belonging to the METTL4 family in P. falciparum, PfN6AMT encoded by the PF3D7_1303100 gene, and demonstrate that recombinant PfN6AMT exhibits robust methyltransferase activity in vitro, with mutation of its active site residues abolishing catalytic activity. Collectively, our findings demonstrate that 6mA is a low-abundance, yet reproducible, feature of the P. falciparum epigenome that is associated with transcriptionally active chromatin, and that the molecular mechanisms governing DNA adenine methylation may have undergone substantial evolutionary divergence, even among closely related eukaryotic lineages.

10

Quantifying the contribution of DNA conformational flexibility to transcription factor binding on nucleosomal DNA uncovers indirect readout across diverse TF families

Dey, U.; Martinez, G. S.; Kumar, R.; Yella, V. R.; Kumar, A.

2026-06-06 bioinformatics 10.1101/2025.05.21.655105 medRxiv

Top 0.2%

15.2%

Show abstract

BackgroundEukaryotic gene regulation depends on transcription factors (TFs) recognizing short DNA motifs within chromatin. Many of these motifs lie within nucleosomes, where DNA is sharply bent, rotationally phased, and constrained by histone-DNA contacts. Yet only a subset is occupied in any cellular context. Motif identity alone, therefore, cannot fully explain selective TF engagement with nucleosomal DNA. We asked whether sequence-derived DNA conformational flexibility provides an interpretable representation of sequence context relevant to TF recognition on nucleosomes. ResultsWe compiled five DNA flexibility descriptors in the Python package DNAflexpy, representing bendability, torsional deformation, backbone conformational variability, and stiffness. We built quantitative models of TF binding affinity across 226 datasets from a high-throughput in vitro TF-nucleosome binding assay. Flexibility-augmented models improved prediction over mononucleotide baselines in most datasets, with smaller but reproducible gains over trinucleotide baselines. The gains were not uniform: they varied across TF families and were concordant with DNA shape-fluctuation features, suggesting that DNAflexpy descriptors capture a sequence-encoded structural signal. In PIONEAR-seq data, model performance generalized across nucleosomal templates in a TF- and sequence-dependent manner. Beyond prediction, position-resolved flexibility footprints revealed deformation signatures at cognate motifs and flanking regions across diverse TF families. For SOX11, model-derived footprints aligned with DNA shape fluctuations from nanosecond-to-microsecond molecular dynamics trajectories of SOX11-bound nucleosomes, consistent with independently observed DNA conformational dynamics and bound-state stabilization. The in vivo data showed a similar but more context-dependent pattern. OCT4 occupancy tended to correlate with local flexibility, whereas GATA3-pioneered regions showed flexibility coupled with altered rotational positioning of cognate motifs. Flexibility-augmented classifiers further improved discrimination of occupied nucleosomal motifs across ENCODE datasets. Torsional flexibility features, particularly twist dispersion and trx, were most informative for classification. ConclusionsSequence-derived DNA conformational flexibility provides a quantitative and interpretable representation of sequence context in TF recognition on nucleosomes. By augmenting sequence with structural information, these models help quantify and interpret an indirect-readout contribution in which DNA deformation tendencies may complement motif sequence and DNA shape. This framework may help explain why only selected motif instances are engaged in chromatin, without treating flexibility as independent of primary sequence.

11

Long read sequencing reveals novel isoforms and spliceosome-mutant-enriched transcripts in AML and MDS

Miller, C. A.; Srivatsan, S. N.; Kramer, M. H.; Ramakrishnan, S. M.; Fronick, C. C.; Fulton, R. S.; Helton, N. M.; Ley, T. J.; Walter, M. J.

2026-05-21 cancer biology 10.64898/2026.05.20.726635 medRxiv

Top 0.2%

15.1%

Show abstract

The alternative splicing landscape of cancer transcriptomes remains poorly characterized, since short read sequencing cannot resolve complete transcript structures. Using the Oxford Nanopore cDNA platform, we generated nearly 2 billion long reads (median 25.8 million per sample) from 71 human samples, including 48 acute myeloid leukemia or myelodysplastic syndrome samples, 25 of which had splicing-factor gene mutations (in SRSF2, U2AF1, or SF3B1). An additional 23 samples were from sorted hematopoietic cell populations from healthy individuals. We identified 174,162 novel isoforms absent from the reference transcriptome, and proteomic validation confirmed that many are translated. We also identified isoforms enriched in spliceosome-mutant samples, and found proteomic evidence of frequent nonsense-mediated decay regulation of novel transcripts. This dataset is a valuable community resource, enabling detection of new transcripts in short read data sets. An interactive portal to explore splicing patterns in these data is available at https://leylab.org/isoforms/.

12

Variation and selection at predicted G-quadruplexes across the human pangenome

Mohanty, S. K.; Marin, M. G.; Smeds, L.; Chiaromonte, F.; Huber, C. D.; Makova, K. D.; Human Pangenome Reference Consortium,

2026-06-23 genomics 10.64898/2026.06.18.733261 medRxiv

Top 0.2%

15.0%

Show abstract

G-quadruplexes (G4s), non-canonical DNA structures whose sequence motifs occupy approximately 1% of the human genome, are important for myriad cellular functions, including regulating transcription and replication. Yet they also contribute to genomic instability by increasing mutations and structural variation. Despite their significance, G4 motifs have not been studied in detail across multiple human genomes. Here, we conducted a comprehensive analysis of presence/absence and sequence variation, measured selection strength, and evaluated gene expression regulation potential for predicted G4s (pG4s) across population groups in the second release of the Human Pangenome Reference Consortium dataset, comprising high-quality, near-telomere-to-telomere diploid genomes from 231 individuals worldwide, along with three reference assemblies. Across the human pangenome, we identified over 353 million pG4s, including 1.15 million pG4s absent from reference assemblies but shared across other haplotypes. Our analysis revealed that pG4 sharing patterns recapitulate human population structure: African individuals displayed lower levels of pG4 sharing than non-Africans, whereas East Asian individuals exhibited higher levels of sharing. By analyzing the site frequency spectrum across various genomic annotations, we computed and compared selection coefficients (Sd) at pG4 vs. non-pG4 sites. As expected, the strongest purifying selection (Sd [≥] 10) was detected at protein-coding exons, where pG4 sites had similar or lower selection coefficients compared with those for pG4 sites. Strikingly, this pattern reversed at regulatory regions: although purifying selection was weaker overall at promoters, introns, enhancers, and replication origins (1 [≤] Sd < 10), pG4 sites at these regions experienced stronger selection than non-pG4 sites--suggesting that pG4s play functional roles outside coding sequences. Additionally, by integrating pG4 data with long-read transcriptome data profiles from this large cohort, we found that pG4s located at promoters and at (or near) exon-intron junctions may influence variation in gene expression levels and transcript isoforms, respectively, across the human pangenome individuals. Leveraging extensive population-scale data, our research illuminates the fundamental importance and functional relevance of G4s across human genomes.

13

SnoRNA Expression and RNA 2'-O-Methylation in Drosophila melanogaster S2 Cells

Ye, X.; Liu, Y.; Olson, S.; Zhan, L.; Carmichael, G. G.; Graveley, B.

2026-05-22 genomics 10.64898/2026.05.21.726978 medRxiv

Top 0.2%

14.9%

Show abstract

Small nucleolar RNAs (snoRNAs) are a class of non-coding RNAs that play critical roles in guiding 2-O-methylation (Nm) and pseudouridylation modifications of RNAs. In Drosophila melanogaster, snoRNAs undergo dynamic changes in expression during development. In this study, we identified 239 snoRNAs that are robustly expressed in Drosophila S2 cells, representing 87% of all annotated Drosophila snoRNAs. Given that box C/D snoRNAs guide site-specific 2-O-methylation (Nm) of RNA, we next characterized the Nm landscape of S2 cells using RibOxi-seq2, a high-throughput approach capable of detecting Nm modifications with single-nucleotide resolution. RibOxi-seq2 revealed 17 Nm sites in 18S rRNA with a 94% concordance to previously reported RiboMeth-Seq data. In 28S rRNA, 30 Nm sites were identified, corresponding to an 71.4% overlap with established references. Additionally, we detected both a known Nm site (Gm74) and a novel site (Um66) in 5.8S rRNA, further validating the sensitivity and specificity of the approach. RibOxi-seq2 further identified Nm sites in small nuclear RNAs (snRNAs), expanding the annotation of modified non-coding RNAs. Additionally, the method revealed Nm modifications within internal regions of mRNAs. In total, we detected Nm modifications in 2,057 unique mRNAs, underscoring the widespread presence of this epitranscriptomic modification in coding transcripts. Strikingly, although we could not identify any snoRNAs predicted to guide the mRNA 2-O-methylation modifications by canonical mechanisms, we identified strong consensus sequences surrounding many of these mRNA sites. Together, our findings not only expand the known landscape of Nm-modified RNAs but also highlight the robustness of RibOxi-seq2 for transcriptome-wide RNA modification profiling. Collectively, this study presents a comprehensive atlas of snoRNA expression and 2-O-methylation sites in Drosophila S2 cells, offering valuable insights into the epitranscriptomic landscape orchestrated by snoRNAs.

14

Evolutionarily labile pachytene piRNAs target an altered set of mRNAs in male hybrids of house mouse subspecies

Saflund, M.; Askari, M.; Eghbali, A.; Abdi, M. M.; Fitzpatrick, J. L.; Yu, T.; Ozata, D. M.

2026-07-08 evolutionary biology 10.64898/2026.07.08.737336 medRxiv

Top 0.2%

14.9%

Show abstract

During male meiosis-I of placental mammals, ~30-nucleotide pachytene PIWI-interacting RNAs (piRNAs) are expressed to regulate genes required for sperm function. Pachytene piRNA genes evolve rapidly. Whether rapid evolutionary turnover of pachytene piRNAs is under positive selective pressure remains enigmatic. Here, we investigate the evolutionary rate of pachytene piRNA genes over a short evolutionary timescale using geographically isolated mouse subspecies. We demarcate the genes producing postnatal piRNAs in PWK/PhJ and CAST/EiJ. Comparative genomics reveals 16 subspecies-specific pachytene piRNA loci underscoring how labile pachytene piRNA genes are even during short evolutionary timescale. We report a highly abundant CAST/EiJ-specific pi17-CAST locus defying the notion that young pachytene piRNA genes do not produce abundant piRNAs. In fact, male hybrids from the reciprocal crossing C57BL/6J and CAST/EiJ produce pi17-CAST piRNAs almost exclusively from the CAST/EiJ allele suggesting that species-specific nucleotide variants are sufficient to turn a locus into piRNA source. Intriguingly, hybrid males with reduced fertility features retain distinct piRNA-mRNA pairs compared to parents. Our work reveals that rapidly evolving pachytene piRNAs can gain or lose targets in the hybrid males of closely related mammalian species.

15

mChIP-seq for Multiplex and Multifactorial Epigenomic Profiling Uncovers Cancer-specific Histone Features in Cellular and Circulating Nucleosomes

Sun, C.; Zhang, Q.; Yan, J.; Wang, X.; Zhang, C.; Li, Y.; Li, J.; Xu, W.

2026-04-29 genomics 10.64898/2026.04.27.721226 medRxiv

Top 0.2%

14.9%

Show abstract

Epigenomic profiling facilitates access to investigate regulatory roles of histone marks in a type-specific cell, and serves as a critical path for discovering noninvasive epigenetic models in cell-free nucleosomes. Here, we present mChIP-seq, an epigenomic profiling technology that is compatible with both cell and cell-free samples for synchronously profiling multifactorial epigenetic landscapes on multiple samples. Combining sample indexing in a single reaction with a pool-and-split strategy for immunoprecipitation, mChIP-seq enhances efficiency and reduces cost. Using mChIP-seq, we profiled H2A.Z and 10 histone modifications in cell lines representing 9 cancer types. Integrative analyses further revealed an atypical association of H2A.Z and H3K4me3 at promoter regions in cancer. Based on mChIP-seq, we developed cf-mChIP-seq for circulating nucleosomes, which requires as little as 25 l of plasma per profile. Profiling 38 plasma samples for H2A.Z, H3K4me3, H3K27ac, and H3K9me3 with cf-mChIP-seq revealed distinct histone mark-associated cfDNA fragment patterns in breast cancer versus healthy control, highlighting the potential of cf-mChIP-seq to expand liquid biopsy methodologies. These results demonstrate that mChIP-seq is a widely applicable technology for large-scale epigenomic profiling of nucleosomes in cellular or cell-free forms.

16

Somatic variant detection in normal tissues from single-cell sequencing data

Luo, R.; Wang, Z.; Dou, J.; Bhamidipati, S. V.; Kalra, D.; Grochowski, C. M.; Doddapaneni, H. V.; Gibbs, R. A.; Chen, K.; Chen, R.

2026-06-14 bioinformatics 10.64898/2026.06.10.731451 medRxiv

Top 0.2%

14.9%

Show abstract

A crucial advantage of single-cell sequencing (SCS) is its ability to identify somatic variants in individual cells, enabling phylogenetic analysis of cellular populations within bulk tissues. While identifying somatic variants in tumor tissues via SCS has become a common practice, doing so in normal tissues remains challenging due to the rarity of somatic variants in normal cells. To evaluate the feasibility of somatic variant calling from widely available single-nucleus RNA-seq (snRNA-seq) and single-nucleus ATAC-seq (snATAC-seq) data, we profiled a Cell-line mix of six HapMap samples prepared by the SMaHT consortium using 10x Genomics 5 snRNA-seq (12k cells with 36k mean reads per cell) and snATAC-seq (11k cells with 14k median high-quality fragments per cell) for variant calling. PacBio long-read whole genome sequencing (WGS) data (109x) generated from individual cell lines were used as ground truth. Two computational tools, Monopogen and SComatic, were used for somatic variant calling from the SCS data. Monopogen achieved single nucleotide variant (SNV) detection accuracies of 93.30% in the snRNA-seq and 99.64% in the snATAC-seq data, both of which outperformed SComatic (74.35% and 94.29%, respectively). Monopogen also consistently detected somatic SNVs at cellular fractions as low as 0.5% (2.54% in snRNA and 0.81% in snATAC) in individual samples. Notably, snATAC-seq exhibited higher genomic coverage breadth and larger number of variants detected than snRNA-seq. While the SCS data have lower overall genome coverage than that of the bulk WGS, the single-cell level variant resolution allows Monopogen to assign variants to their cells of origin with over 80% accuracy in both RNA and ATAC modalities, thereby facilitating studies of clonal evolution and cell-type-specific mutagenesis. Other benchmarking methods were also evaluated (DeepVariant, Cellsnp-lite and Mutect2) for comparison. In conclusion, our study demonstrated the feasibility of performing reliable single-cell somatic mutation calling in a cell-line mixture and discussed the strengths and limitations of current computational methods when applied to normal tissues.

17

kmerRRR: A k-mer based tool for functional genomics in Repeat Rich Regions

Rahmat, J.; Pham, T. M.; Larracuente, A. M.

2026-06-25 genomics 10.64898/2026.06.21.732238 medRxiv

Top 0.2%

14.8%

Show abstract

Highly repetitive sequences pose problems for genome assembly and analysis. While advances in long-read sequencing technologies have helped reveal the organization of repetitive genomic sequences at unprecedented resolution, their functional characterization remains difficult because molecular assays that probe protein-DNA interactions and characterize expression often rely on short read sequencing. The repetitive nature of these regions poses major challenges for methods relying on sequence mapping, which is exacerbated for short reads. Repetitive genome regions often have low mappability, leading to substantial information loss during downstream filtering. To address this challenge, we developed a bioinformatic tool--kmerRRR--that leverages k-mer frequency analyses to enhance the mappability of repetitive regions. KmerRRR compares k-mer frequencies within user-defined loci to their frequencies across the genome to identify repetitive sequences that are overrepresented locally relative to the global background. This approach quantifies locus uniqueness, allowing users to distinguish sequences that are globally repetitive from those that are repetitive, but restricted to specific genomic loci. We demonstrated the utility of this method by reanalyzing chromatin profiling data from human, Drosophila, and Arabidopsis centromeres and small RNA sequencing data. Our results show that incorporating local k-mer ratio information enhances read retention and signal interpretation within repetitive regions, thereby recovering biologically meaningful information that is typically lost in conventional analyses. The tool is freely available under MIT license in github: (https://github.com/LarracuenteLab/kmerRRR).

18

Carbon: Decoding the Language of Life

Allal, L. B.; Li, Q.; Fiusco, M.; Tunstall, L.; Rasul, K.; Beeching, E.; Aubakirova, D.; Patino, C.; Frere, T.; Lozhkov, A.; Channing, G.; Wolf, T.; Bernardo, D. d.; Werra, L. v.

2026-05-25 genomics 10.64898/2026.05.22.727119 medRxiv

Top 0.2%

14.6%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWGenomic foundation models have emerged alongside the rapid progress of large language models, offering a promising framework for learning general-purpose sequence priors for DNA understanding, generation, and design. This connection to LLMs creates a major opportunity: modern architectures, scaling infrastructure, autoregressive training, and token-based modeling provide powerful tools for genomic sequence modeling. At the same time, DNA differs fundamentally from natural language. Genomic sequences are noisy, redundant, sparsely constrained, unevenly annotated, and shaped by evolutionary rather than communicative pressures. As a result, key components of the standard LLM recipe, including data construction, tokenization, and training objectives, must be reconsidered in the biological sequence setting. A central challenge in DNA modeling is reconciling single-nucleotide resolution with long-context reasoning. Single-nucleotide resolution is essential for variant effect prediction, splice-site analysis, and codon-level reasoning. Long-context modeling is equally important, as many genomic mechanisms depend on distal regulatory elements, gene neighborhoods, and long-range evolutionary constraints. However, the most direct path to nucleotide-level reasoning, single-nucleotide tokenization, makes genomic sequences extremely long and imposes substantial computational cost on Transformer models. We present CO_SCPLOWARBONC_SCPLOW, a family of efficient generative DNA language models designed as a practical reference point for this setting. CO_SCPLOWARBONC_SCPLOW includes 3B- and 8B-parameter decoder-only autoregressive models using non-overlapping 6-mer tokenization. CO_SCPLOWARBONC_SCPLOW-3B supports a maximum context length of 65,536 tokens, corresponding to approximately 393 kbp of DNA; CO_SCPLOWARBONC_SCPLOW-8B supports up to 131,072 tokens, roughly 786 kbp. This simple and controlled setup helps isolate a central question for DNA language modeling: whether current progress is limited primarily by model architecture and nominal context length, or by more basic alignment between data, tokenization, objectives, evaluation, and the biological structure of genomic sequence. In our training-free evaluation suite, CO_SCPLOWARBONC_SCPLOW-3B is competitive with Evo2-7B despite having less than half the parameters. CO_SCPLOWARBONC_SCPLOW-8B improves on CO_SCPLOWARBONC_SCPLOW-3B on every training-free task, with the largest gain on long-context retrieval. Both models deliver tens-fold faster inference under comparable settings. The CO_SCPLOWARBONC_SCPLOW recipe combines annotation-aware data curation, deterministic 6-mer tokenization, and a staged CE-to-FNS objective schedule, adapting the LLM recipe to the statistical and biological properties of DNA rather than directly transplanting it. We release the models, data, training code, and evaluation suite, including new training-free probes for sequence-level perturbation and DNA long-context retrieval. CO_SCPLOWARBONC_SCPLOW is intended as an open recipe for efficient generative DNA modeling rather than an argument for any specific architecture, tokenization strategy, or objective design as the optimal solution. Its strong performance provides grounded evidence that substantial room remains for domain-aware model design carefully aligned with the genomic sequence itself.

19

TDKC (Target Distilled K-mer Classifier): Ultrafast and Memory-Efficient Sequence Classification for Target Pathogen Diagnostics

Lee, S.; Agarwal, V.; O'Brien, W.; Eskin, E.

2026-06-06 bioinformatics 10.64898/2026.06.05.730319 medRxiv

Top 0.2%

14.6%

Show abstract

Metagenomic sequencing can identify pathogens from clinical samples without prior knowledge of the causative agent. Yet, as sequencing workflows scale to process thousands of multiplexed samples simultaneously, classifying these samples against massive reference databases creates a significant computational bottleneck. Furthermore, large-scale applications such as screening public sequence repositories remain computationally challenging. Existing metagenomic classifiers are designed for full-taxon classification, where the goal is to identify all organisms in a sample. However, many diagnostic applications focus on detecting a specific set of clinically relevant pathogens. This constraint can be exploited to significantly lower computational costs. Here we present TDKC (Target Distilled K-mer Classifier), a method for targeted metagenomic classification. TDKC constructs a compact index by distilling target-specific k-mers from a full-taxon reference database. When classifying clinical samples, TDKC uses 16.9-33.6x less memory and is 5.2-34.3x faster than per-read full-taxon and targeted classifiers (Kraken2, Centrifuger, CLARK), while maintaining high sensitivity and low false positive rates. Against the sketch-based profiler Sylph, TDKC remains 4.2x faster and uses 8.5x less memory. TDKC also supports per-k-mer accession tracking across over 3 million source accessions for downstream subtype analysis, and domain-level detection of bacteria, archaea, and viruses. By reducing the index to only the pathogens of interest, TDKC makes targeted pathogen detection feasible at scale.

20

NanoLabel: A fast and accurate real-time nanopore signal classifier

Mahajan, D.; Jain, C.; Kashyap, N.

2026-05-06 genomics 10.64898/2026.05.03.722500 medRxiv

Top 0.2%

13.4%

Show abstract

Oxford Nanopore Technologies adaptive sampling capability promises to reduce sequencing cost and turnaround time. At its core, adaptive sampling is a real-time classification problem that distinguishes reads originating from regions of interest. Direct signal-based classification approaches bypass the computational bottleneck of basecalling and can eliminate the need for powerful GPUs. However, operating directly on noisy raw signals remains challenging in real-time settings, where classification decisions must be made quickly. In this work, we propose NanoLabel, a new method for real-time classification of nanopore signals. We build NanoLabel on top of signal-based read mapping tool, RawHash2. We accelerate the classification workflow by mapping reads using only the target regions as the reference. To further improve accuracy, we train a lightweight classifier on mapping-derived features and introduce a data augmentation strategy to construct sufficiently large and class-balanced training datasets. We evaluate NanoLabel using publicly available real sequencing datasets from three human genomes (HG001, HG002, and HG005), while assuming a cancer gene panel as the target. Compared to directly mapping reads with RawHash2, we demonstrate 80 x improvement in the classification time and 0.10 - 0.25 units improvement in the F1 score.